Variable Metadata Var Class_2026 Var Class_2023
1 company TRUE character <NA>
2 abi FALSE numeric numeric
3 year_established TRUE numeric numeric
4 primary_naics6_code TRUE numeric numeric
5 naics6_descriptions TRUE character character
6 subsidiary_number TRUE character <NA>
7 company_holding_status TRUE numeric <NA>
8 address_line_1 FALSE character character
9 city FALSE character character
10 state FALSE character character
11 zipcode FALSE character numeric
12 zip4 FALSE character <NA>
13 primary_sic_code TRUE numeric <NA>
14 sic6_descriptions TRUE character <NA>
15 sic_code TRUE numeric <NA>
16 sic6_descriptions_sic TRUE character <NA>
17 sic_code_1 TRUE numeric <NA>
18 sic6_descriptions_sic1 TRUE character <NA>
19 sic_code_2 TRUE numeric <NA>
20 sic6_descriptions_sic2 TRUE character <NA>
21 sic_code_3 TRUE numeric <NA>
22 sic6_descriptions_sic3 TRUE character <NA>
23 sic_code_4 TRUE logical <NA>
24 sic6_descriptions_sic4 TRUE logical <NA>
25 latitude FALSE numeric numeric
26 longitude FALSE numeric numeric
27 census_block FALSE numeric <NA>
28 census_tract FALSE character <NA>
29 county_code FALSE character <NA>
30 fips_code FALSE character <NA>
31 cbsa_level FALSE numeric <NA>
32 cbsa_code FALSE character <NA>
33 csa_code FALSE character <NA>
34 area_code FALSE numeric <NA>
35 site_number TRUE character <NA>
36 yellow_page_code TRUE numeric <NA>
37 office_size_code TRUE logical <NA>
38 employee_size_location TRUE numeric <NA>
39 location_employee_size_code TRUE character <NA>
40 sales_volume_location TRUE numeric <NA>
41 location_sales_volume_code TRUE character <NA>
42 parent_number TRUE character <NA>
43 parent_actual_employee_size TRUE numeric <NA>
44 parent_employee_size_code TRUE character <NA>
45 parent_actual_sales_volume TRUE numeric <NA>
46 parent_sales_volume_code TRUE character <NA>
47 primary_naics2_code TRUE numeric numeric
48 address_type_indicator TRUE character <NA>
49 industry_specific_first_byte TRUE character <NA>
50 business_status_code TRUE numeric <NA>
51 population_code TRUE numeric <NA>
52 match_code TRUE character <NA>
53 ticker TRUE logical <NA>
54 idcode TRUE numeric <NA>
Revise Process to Support the Updated Format
Last Updated: May 19th, 2026
In May 2026, an updated version of the raw data was provided in a different format than the version exported in July 2023 and provided in the summer of 2025. As a result, data processing was split into two paths: one for the 2023 format and one for the 2026 format.
The 2026 versions of comparable sections should be treated as the most current method applicable for ongoing research. Insights from the 2023 format are assumed to apply to the 2026 format as well. Unexpected outcomes or shortcomings in algorithm performance identified in the 2023 drafts were used to inform improvements applied in the 2026 version.
The purpose of this page is to detail the differences between the 2023 and 2026 formats and provide a framework for handling the additional metadata. The insights detailed in Review - 2023 Format are assumed to translate to this format, as the original data source is the same. In addition to the new metadata, the data is also structured in a long format, where additional years are represented as rows rather than columns. This report details how these nuances impact the data transformation and how that information is captured when translated into a years-by-column format.
Selected code excerpts are provided below. The complete code used to generate this report can be found in Code/Process Data Update.R. Functions referenced can be found in Code/Support Functions/For Processing New Data.R.
Raw Data Dictionary
File name: church_long_form_050926.csv
Number of rows: 10,201,197
Number of columns: 54
company:The business name.archive_version_year:The year the information was added, spanning 2000 to 2025.abi:American Business Identifier (ABI): The unique identifier for each business, used to associate addresses with a specific organization.year_established:Indicates the year the business was first established.primary_naics8_code:This dataset combines the six-digit 2022 vintage North American Industry Classification System (NAICS) code and a two-digit proprietary code from Data Axle. The two-digit Data Axle code provides additional classification details not reflected here.naics8_descriptions:All entries have NAICS = 813110, indicating they are classified as “Religious Organizations”, which includes churches, shrines, monasteries (except schools), synagogues, mosques, and temples.subsidiary_number:A unique identifier assigned to a subsidiary of the parent company, used to link the business location back to its specific subsidiary entity within the broader corporate hierarchy.company_holding_status:Indicates the holding or ownership status of the company, distinguishing between classifications such as headquarters, branch, subsidiary, or standalone entity within a corporate structure.address_line_1:The primary address line associated with the ABI entry.city:The city associated with the address.state:The state associated with the address.zipcode:The five-digit ZIP code associated with the address.zip4:The four-digit ZIP code extension associated with the address.primary_sic_code:The primary Standard Industrial Classification (SIC) code assigned to the business, representing its main industry or business activity.sic6_descriptions:Description of the SIC code.sic_code_[1-4]:Up to four additional SIC codes assigned to the business, capturing additional or secondary industries and activities beyond the primary classification.sic6_descriptions_sic[1-4]:Description of the addditional SIC codes.longitude and latitude:Given as two columns. Provides the geolocation associated with the address listed in that row.census_block:The smallest geographic unit defined by the U.S. Census Bureau. Provides a precise sub-tract level location identifier associated with the address listed in that row.census_tract:A small, relatively permanent geographic subdivision of a county defined by the U.S. Census Bureau, used for statistical and demographic analysis.county_code:A numeric code identifying the county associated with the address listed in that row.fips_code:The Federal Information Processing Standards (FIPS) code assigned to the geographic area, providing a standardized identifier for states, counties, and other administrative regions.cbsa_level:Indicates the level of the Core-Based Statistical Area (CBSA), distinguishing between Metropolitan Statistical Areas (MSAs) and Micropolitan Statistical Areas.cbsa_code:A numeric code identifying the Core-Based Statistical Area (CBSA) associated with the address, as defined by the U.S. Office of Management and Budget (OMB).csa_code:A numeric code identifying the Combined Statistical Area (CSA) associated with the address, representing a grouping of adjacent CBSAs with strong economic and social ties.area_code:The telephone area code associated with the business location.site_number:A unique identifier assigned to a specific physical location or site of a business, used to distinguish individual locations within the same organization.yellow_page_code:A code indicating the Yellow Pages heading(s) under which the business is listed, reflecting how the business advertises or categorizes itself in directory listings.office_size_code:A coded value representing the size of the office or physical location.employee_size_location and location_employee_size_code:Given as two columns. Provides the actual employee count at the specific location and a corresponding code representing that count within a size classification range.sales_volume_location and location_sales_volume_code:Given as two columns. Provides the actual sales attributed to the specific location and a corresponding code representing that number within a size classification range.parent_number:A unique identifier assigned to the parent company of the business, used to link individual locations or subsidiaries back to their controlling organization.parent_actual_employee_size and parent_employee_size_code:Given as two columns. Provides the total employee count for the parent company and a corresponding code representing that count within a size classification range.parent_actual_sales_volume and parent_sales_volume_code:Given as two columns. Provides the total sales for the parent company and a corresponding code representing that number within a size classification range.Unknown variables or code translations:address_type_indicator,industry_specific_first_byte,business_status_code,population_code,match_code, ticker, andidcode.
Correspondance to the 2023 Format
Core fields (abi, address_line_1, city, and state) are consistent between the old and new data. Not shown here, the 2026 format introduces a variable archive_version_year that appends new years of information as rows, whereas the 2023 format extended this information column-wise as binary columns. Notably, the new data stores the zip code as a five-digit character field, preserving flanking zeros, and includes a four-digit zip code extension.
The new dataset also introduces additional columns covering census and location attributes, as well as business-level details including size, status, and SIC encoding. Because the data structure appends new information row-wise, care must be taken to ensure these metadata fields are correctly collapsed when the data is transformed into single rows per ABI and address. Ideally, the structural transformation should occur after addresses have been grouped and matched to similar ones, to reduce redundancies introduced by typographical errors.
Variables can be categorized as follows:
Variable Group
1 company Core business details
2 abi Core business details
3 year_established Core business details
4 primary_naics6_code Core business details
5 naics6_descriptions Core business details
6 subsidiary_number Core business details
7 company_holding_status Core business details
8 address_line_1 Core business details by location
9 city Core business details by location
10 state Core business details by location
11 zipcode Core business details by location
12 zip4 Core business details by location
13 primary_sic_code Core business details by location
14 sic6_descriptions Core business details by location
15 sic_code Core business details by location
16 sic6_descriptions_sic Core business details by location
17 sic_code_1 Core business details by location
18 sic6_descriptions_sic1 Core business details by location
19 sic_code_2 Core business details by location
20 sic6_descriptions_sic2 Core business details by location
21 sic_code_3 Core business details by location
22 sic6_descriptions_sic3 Core business details by location
23 sic_code_4 Core business details by location
24 sic6_descriptions_sic4 Core business details by location
25 latitude Core location details
26 longitude Core location details
27 census_block Core location details
28 census_tract Core location details
29 county_code Core location details
30 fips_code Core location details
31 cbsa_level Core location details
32 cbsa_code Core location details
33 csa_code Core location details
34 area_code Core location details
35 site_number Core business details by location
36 yellow_page_code Core business details by location
37 office_size_code Core business details by location
38 employee_size_location Core business details by location
39 location_employee_size_code Core business details by location
40 sales_volume_location Core business details by location
41 location_sales_volume_code Core business details by location
42 parent_number Parent company details
43 parent_actual_employee_size Parent company details
44 parent_employee_size_code Parent company details
45 parent_actual_sales_volume Parent company details
46 parent_sales_volume_code Parent company details
47 primary_naics2_code Unknown variables or code translations
48 address_type_indicator Unknown variables or code translations
49 industry_specific_first_byte Unknown variables or code translations
50 business_status_code Unknown variables or code translations
51 population_code Unknown variables or code translations
52 match_code Unknown variables or code translations
53 ticker Unknown variables or code translations
54 idcode Unknown variables or code translations
By this classification, there are are two types of metadata:
Metadata relevant to a given business or location that does not change across years (e.g.
year_established,fips_code)Metadata that may change across years (e.g.
census_block,employee_size_location)
This analysis requires only most of the core business and location details. The remaining metadata will be processed simultaneously to produce a final, cleaned dataset, though not with any specific follow-up analysis in mind. As a result, the metadata may not be in a format suitable for all analyses, as the data are primarily prepared for the dashboard.
Metadata are processed by considering change across location (distinguishing between values consistent for a given business and those that vary by location) and year. For both types of location variation, value changes by year are captured to maintain granularity. While these are not expected to indicate anything actionable, they may reveal opportunities for standardization or data cleaning and validation. Some of these opportunities will be taken here, while the remainder will be left for follow-up analyses.
To capture year-level granularity in value changes, a date range will be added to each unique value when collapsed by abi and combined_address (a formatted combination of all address components). For fields that are relevant to this project, unique values that do not span the expected years — the years in which that abi and address were recorded — will be cleaned to report only the distinct value.
In the following sections, specific metadata fields are assessed for quality and to develop this condensing approach.
NAICS and SIC Encodings
NAICS and SIC encodings provide additional context about the type of business, religious affiliation, and function. The 2023 form of the data only included the NAICS encoding, while the 2026 form includes both.
Primary NAICS and SIC Encodings
When processing the 2023 data, it was identified that the NAICS code is eight digits long, with the trailing two digits representing an unknown proprietary code from Data Axle. These will be separated into a six-digit (naics6) code with description and a two-digit (naics2) code with no associated description.
Both versions of the data share the same unique NAICS eight-digit encodings. The 2023 data processing protocol retained only the unique six-digit NAICS codes and their descriptions, which are consistent across all entries; the trailing two digits were stored as a string. The method devised to retain applicable years will also be applied to this field.
SIC refines the classification of religious organizations by capturing attributes such as affiliation, denomination, and operational function (e.g., school, retreat center). This information was not provided in the previous version. It is provided as both a primary classification column accompanied by up to four additional overflow columns. All are examined together in the following section.
Primary and Additional SIC Encodings
The following assumptions about the SIC encodings require review and confirmation:
Overflow columns are filled progressively with no other discernible pattern.
The same categories appear across all columns, with the possible exception of the primary SIC column, which may differ or be restricted.
Each code maps to a unique description and vice versa.
Nomenclature is consistent throughout.
The following results apply to the most recent data export in the 2026 format at the time of this writing. To ensure subsequently exported data in this format are appropriately processed, they will need to be reaudited to confirm the validity of these assumptions or flag any corrections that are required.
Show the code
# SIC column auditing function devised to quickly assess the above assumptions
# for recently exported data in the 2026 format.
sic_overflow_audit <- function(df,
sic_code_cols = c("primary_sic_code","sic_code","sic_code_1","sic_code_2","sic_code_3","sic_code_4"),
sic_desc_cols = c("sic6_descriptions","sic6_descriptions_sic","sic6_descriptions_sic1",
"sic6_descriptions_sic2","sic6_descriptions_sic3","sic6_descriptions_sic4"),
id_cols = "abi") {
#' @description
#' Audits a dataset that stores SIC classifications in a primary column plus
#' multiple “overflow” columns (e.g., `sic_code_1` ... `sic_code_4`) with a
#' corresponding set of description columns. The function:
#' (1) runs an overflow-pattern check on codes and descriptions,
#' (2) identifies rows where the description overflow pattern fails (often due to
#' missing descriptions),
#' (3) quantifies overlap in (code, desc) pairs across sources,
#' (4) builds a unified (sic_code -> sic_desc) mapping table and checks one-to-one
#' consistency, and
#' (5) identifies SIC codes that appear sometimes with NA descriptions and sometimes
#' with non-NA descriptions.
#'
#' @param df A data.frame/tibble containing SIC code and description columns.
#' @param sic_code_cols Character vector of SIC code column names, ordered from
#' primary to most-overflowed.
#' @param sic_desc_cols Character vector of SIC description column names aligned
#' 1:1 with `sic_code_cols`.
#' @param id_cols Optional character vector of identifier columns to retain when
#' returning problematic rows (default: `"abi"`). Only columns present in `df`
#' are used.
#'
#' @details
#' This function assumes you have an `overflow_check(df, cols)` helper available
#' in your environment that returns a data frame with (at least) an `overflow_ok`
#' logical column describing whether each row follows the expected overflow pattern.
#'
#' @return A named list containing:
#' \describe{
#' \item{sic_code_cols_used}{Aligned SIC code columns actually used (present in `df`).}
#' \item{sic_desc_cols_used}{Aligned SIC description columns actually used (present in `df`).}
#' \item{overflow_checked_code}{Row-level output of `overflow_check()` for code columns.}
#' \item{overflow_checked_desc}{Row-level output of `overflow_check()` for description columns.}
#' \item{overflow_summary_code}{Counts of rows that follow the progressive overflow pattern (OK vs not OK) for codes.}
#' \item{overflow_summary_desc}{Counts of rows that follow the progressive overflow pattern (OK vs not OK) for descriptions.}
#' \item{missing_desc_rows}{Subset of rows where description overflow check fails.}
#' \item{missing_desc_code}{Unique SIC codes that appear with NA descriptions among failing rows.}
#' \item{missing_desc_codes_are_subset_of_no_desc}{Logical check: confirm if all SIC codes correspond to NA description.}
#' \item{presence_wide}{Unique (sic_code, sic_desc) pairs widened to show presence by column.}
#' \item{presence_tabs}{Tabulations of presence (TRUE/FALSE/NA) by source column.}
#' \item{map_tbl}{Unified unique mapping of non-NA (sic_code, sic_desc) pairs.}
#' \item{n_unique_pairs}{Number of unique non-NA pairs in `map_tbl`.}
#' \item{code_to_desc_consistent}{TRUE if each sic_code maps to exactly one sic_desc.}
#' \item{desc_to_code_consistent}{TRUE if each sic_desc maps to exactly one sic_code.}
#' \item{codes_with_multiple_desc}{SIC codes mapping to != 1 description (if any).}
#' \item{desc_with_multiple_codes}{Descriptions mapping to != 1 code (if any).}
#' \item{codes_with_mixed_na_desc}{Non-NA mappings for codes that are NA somewhere else.}
#' }
#'
#' @export
# ---- Basic input validation ----
stopifnot(
is.data.frame(df),
is.character(sic_code_cols),
is.character(sic_desc_cols),
length(sic_code_cols) == length(sic_desc_cols)
)
# ---- Align code/desc pairs to columns that actually exist in df ----
# Keep only those indices where BOTH the code col and the paired desc col exist.
code_present <- intersect(sic_code_cols, names(df))
desc_present <- intersect(sic_desc_cols, names(df))
idx_code <- match(code_present, sic_code_cols)
idx_desc <- match(desc_present, sic_desc_cols)
keep_idx <- intersect(idx_code, idx_desc)
sic_code_cols <- sic_code_cols[keep_idx]
sic_desc_cols <- sic_desc_cols[keep_idx]
# Keep only ID columns that exist.
id_cols <- intersect(id_cols, names(df))
# ---- Overflow checks (requires overflow_check() in your environment) ----
overflow_checked_code <- overflow_check(df, sic_code_cols)
overflow_checked_desc <- overflow_check(df, sic_desc_cols)
# Summaries: how many rows follow the expected pattern vs not.
overflow_summary_code <- overflow_checked_code %>%
dplyr::summarise(
n_rows = dplyr::n(),
n_ok = sum(overflow_ok, na.rm = TRUE),
n_bad = sum(!overflow_ok, na.rm = TRUE)
)
overflow_summary_desc <- overflow_checked_desc %>%
dplyr::summarise(
n_rows = dplyr::n(),
n_ok = sum(overflow_ok, na.rm = TRUE),
n_bad = sum(!overflow_ok, na.rm = TRUE)
)
# Rows where the *description* overflow pattern fails
# (often because a code exists but its paired description is missing).
missing_desc <- overflow_checked_desc %>%
dplyr::filter(overflow_ok == FALSE) %>%
dplyr::select(dplyr::any_of(c(id_cols, sic_code_cols, sic_desc_cols)))
# ---- Presence/overlap across sources (code+desc pairs) ----
# Each "source" is a column pairing such as "sic_code_1 + sic6_descriptions_sic1"
all_sources <- paste0(sic_code_cols, " + ", sic_desc_cols)
# Long table of unique pairs per source (keeping NA on either side so we can see gaps).
long <- purrr::map2_dfr(sic_code_cols, sic_desc_cols, \(cc, dc) {
df %>%
dplyr::transmute(
sic_code = as.character(.data[[cc]]),
sic_desc = as.character(.data[[dc]])
) %>%
dplyr::filter(!is.na(sic_code) | !is.na(sic_desc)) %>%
dplyr::distinct() %>%
dplyr::mutate(
source = paste0(cc, " + ", dc),
present = TRUE
)
})
# Widen to one row per unique (sic_code, sic_desc), with TRUE/FALSE presence by source.
presence_wide <- long %>%
dplyr::mutate(source = factor(source, levels = all_sources)) %>%
tidyr::pivot_wider(
names_from = source,
values_from = present,
values_fill = FALSE
)
# Tabulate presence by the *source columns that actually exist* in presence_wide.
# (Some expected sources may not appear if there were zero observations.)
source_cols_present <- intersect(all_sources, names(presence_wide))
presence_tabs <- lapply(source_cols_present, \(nm) table(presence_wide[[nm]], useNA = "ifany"))
presence_tabs <- do.call(rbind, presence_tabs)
rownames(presence_tabs) <- source_cols_present
# ---- SIC codes that are missing descriptions (robust to any number of pairs) ----
# Build all (sic_code, sic_desc) pairs from the failing rows and capture codes with NA desc.
missing_desc_code <- character(0)
if (nrow(missing_desc) > 0 && length(sic_code_cols) > 0) {
miss_long <- purrr::map2_dfr(sic_code_cols, sic_desc_cols, \(cc, dc) {
missing_desc %>%
dplyr::transmute(
sic_code = as.character(.data[[cc]]),
sic_desc = as.character(.data[[dc]])
)
})
missing_desc_code <- miss_long %>%
dplyr::filter(is.na(sic_desc), !is.na(sic_code)) %>%
dplyr::distinct(sic_code) %>%
dplyr::pull(sic_code)
}
# All SIC codes paired with NA descriptions anywhere in the presence table.
no_desc <- presence_wide %>%
dplyr::filter(is.na(sic_desc)) %>%
dplyr::pull(sic_code)
# Sanity check: codes flagged via missing_desc should be a subset of the no-desc codes.
missing_desc_codes_are_subset_of_no_desc <- all(missing_desc_code %in% no_desc)
# ---- Build a unified mapping table and run consistency checks ----
# Only non-NA pairs go into the mapping table.
map_tbl <- purrr::map2_dfr(sic_code_cols, sic_desc_cols, \(cc, dc) {
df %>%
dplyr::transmute(
sic_code = as.character(.data[[cc]]),
sic_desc = as.character(.data[[dc]])
) %>%
dplyr::filter(!is.na(sic_code), !is.na(sic_desc)) %>%
dplyr::distinct()
}) %>%
dplyr::distinct()
# a) Each sic_code should map to exactly one sic_desc
code_to_desc <- map_tbl %>%
dplyr::count(sic_code, name = "n_desc") %>%
dplyr::filter(n_desc != 1)
# b) Each sic_desc should map to exactly one sic_code
desc_to_code <- map_tbl %>%
dplyr::count(sic_desc, name = "n_code") %>%
dplyr::filter(n_code != 1)
# c) SIC codes that appear with NA descriptions in some source and non-NA in another.
desc_that_is_sometimes_na <- map_tbl %>%
dplyr::filter(sic_code %in% no_desc)
# ---- Return a single list output (all artifacts + checks) ----
list(
sic_code_cols_used = sic_code_cols,
sic_desc_cols_used = sic_desc_cols,
overflow_checked_code = overflow_checked_code,
overflow_checked_desc = overflow_checked_desc,
overflow_summary_code = overflow_summary_code,
overflow_summary_desc = overflow_summary_desc,
missing_desc_rows = missing_desc,
missing_desc_code = missing_desc_code,
missing_desc_codes_are_subset_of_no_desc = missing_desc_codes_are_subset_of_no_desc,
presence_wide = presence_wide,
presence_tabs = presence_tabs,
map_tbl = map_tbl,
n_unique_pairs = nrow(map_tbl),
code_to_desc_consistent = (nrow(code_to_desc) == 0),
desc_to_code_consistent = (nrow(desc_to_code) == 0),
codes_with_multiple_desc = code_to_desc,
desc_with_multiple_codes = desc_to_code,
desc_that_is_sometimes_na = desc_that_is_sometimes_na
)
}ASSUMPTION A: Each additional SIC encoding is assumed to represent supplementary classification information attributed to that address. The code columns follow the expected pattern; however, the description columns do not. This is likely due to missing descriptions for certain codes. The sic_results$missing_desc_codes_are_subset_of_no_desc = TRUE result confirms that description fields deviating from this pattern do so because of missing descriptions for a given code.
sic_results$overflow_summary_code
#sic_results$overflow_checked_code# A tibble: 1 × 3
n_rows n_ok n_bad
<int> <int> <int>
1 10201197 10201197 0
sic_results$overflow_summary_desc
#sic_results$overflow_checked_desc# A tibble: 1 × 3
n_rows n_ok n_bad
<int> <int> <int>
1 10201197 10201098 99
# Confirm that description deviations from the expected overflow pattern
# are associated with codes that have no corresponding description.
#sic_results$missing_desc_rows
#sic_results$missing_desc_code
sic_results$missing_desc_codes_are_subset_of_no_descASSUMPTION B: If each column represents an overflow of the same underlying information, a significant degree of overlap in the values present across columns would be expected.
# All SIC codes/descriptions
sic_results$presence_wide
# Tabulated outcomes
sic_results$presence_tabs row FALSE TRUE
1 primary_sic_code + sic6_descriptions 2563 43
2 sic_code + sic6_descriptions_sic 353 2253
3 sic_code_1 + sic6_descriptions_sic1 1149 1457
4 sic_code_2 + sic6_descriptions_sic2 1652 954
5 sic_code_3 + sic6_descriptions_sic3 1984 622
There is a high degree of overlap across the overflow SIC encodings, which decreases as the overflow index increases. This is expected, as fewer entries carry more than one or two additional SIC classifications.
Only a small subset of possible SIC codes is utilized in the primary column, suggesting that the available selection may be more restricted at the point of data collection. All 43 outcomes in the primary SIC are also used in overflow SIC columns.
# Pull columns that are logical TRUE/FALSE “sic presence” columns
sic_cols <- sic_results$presence_wide |>
select(where(is.logical)) |>
names()
other_cols <- setdiff(sic_cols, "primary_sic_code + sic6_descriptions")
# All Primary SIC outcomes are present in the overflow categories
sic_results$presence_wide %>%
filter(.data[["primary_sic_code + sic6_descriptions"]] & if_any(all_of(other_cols), ~ .x)) %>%
nrow()ASSUMPTION C: It is assumed that each code maps uniquely to a single description and vice versa. To verify this assumption, a mapping table is constructed from all non-null (code, description) pairs across all column pairs. Consistency checks:
Each sic_code maps to exactly one description.
Each description maps to exactly one sic_code.
sic_codes associated with an NA description are identified and accounted for.
sic_results$map_tbl
list(
n_unique_pairs = sic_results$n_unique_pairs,
code_to_desc_consistent = sic_results$code_to_desc_consistent,
desc_to_code_consistent = sic_results$desc_to_code_consistent,
codes_with_multiple_desc = sic_results$codes_with_multiple_desc,
desc_with_multiple_codes = sic_results$desc_with_multiple_codes,
desc_that_is_sometimes_na = sic_results$desc_that_is_sometimes_na
)$n_unique_pairs
[1] 2583
$code_to_desc_consistent
[1] FALSE
$desc_to_code_consistent
[1] FALSE
$codes_with_multiple_desc
# A tibble: 18 × 2
sic_code n_desc
<chr> <int>
1 279610 2
2 513705 2
3 549903 2
4 561101 2
5 562101 2
6 596316 2
7 799922 2
8 799927 2
9 804908 2
10 822298 2
11 824401 2
12 824903 2
13 835102 2
14 864102 2
15 869904 2
16 874881 2
17 912101 2
18 912104 2
$desc_with_multiple_codes
# A tibble: 2 × 2
sic_desc n_code
<chr> <int>
1 COMPUTER SOFTWARE 2
2 SCHOOLS-BUSINESS & SECRETARIAL 2
$desc_that_is_sometimes_na
# A tibble: 3 × 2
sic_code sic_desc
<chr> <chr>
1 866124 MONASTERIES
2 821120 SCHOOL DISTRICTS
3 737206 COMPUTER SOFTWARE
Some codes are associated with multiple descriptions and vice versa. Additionally, some codes are sometimes associated with an NA description despite having a valid description in other instances. Each case will be addressed individually.
ASSUMPTION D: Many of these inconsistencies are attributable to variations in nomenclature and will be corrected on an individual basis.
# Entries to correct description
fix_desc <- sic_results$presence_wide %>%
filter(sic_code %in% sic_results$codes_with_multiple_desc$sic_code) %>%
select(sic_code, sic_desc) %>%
arrange(sic_code) %>%
group_by(sic_code) %>%
mutate(dup_id = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = sic_code,
names_from = dup_id,
values_from = sic_desc,
names_prefix = "sic_desc_"
) %>%
as.data.frame() sic_code sic_desc
1 279610 PRINTERS SERVICES (MFRS)
2 279610 PRINTERS' SERVICES (MFRS)
3 513705 WOMEN'S APPAREL-WHOLESALE
4 513705 WOMENS APPAREL-WHOLESALE
5 549903 WATER COMPANIES-BOTTLED, BULK, ETC
6 549903 WATER COMPANIES-BOTTLED/BULK & ETC
7 561101 MEN'S CLOTHING & FURNISHINGS-RETAIL
8 561101 MENS CLOTHING & FURNISHINGS-RETAIL
9 562101 WOMEN'S APPAREL-RETAIL
10 562101 WOMENS APPAREL-RETAIL
11 596316 COFFEE & FOOD SVC-MOBILE
12 596316 MOBILE CONCESSIONS
13 799922 NATIONAL LANDMARK
14 799922 NATIONAL LANDMARK/HISTORIC SITE/MEMORIAL
15 799927 CHILDREN'S SVCS & ACTIVITIES INFORMATION
16 799927 CHILDRENS SVCS & ACTIVITIES INFORMATION
17 804908 NURSES & NURSES REGISTRIES
18 804908 NURSES & NURSES' REGISTRIES
19 822298 JUNIOR COLLEGES & TECHNICAL INSTITUTES
20 822298 JUNIOR-COMMUNITY COLLEGE-TECH INSTITUTES
21 824401 SCHOOLS-BUSINESS & SECRETARIAL
22 824401 SCHOOLS-BUSINESS & VOCATIONAL
23 824903 SCHOOLS-INDUSTRIAL TECHNICAL & TRADE
24 824903 SCHOOLS-TRADE & VOCATIONAL
25 835102 SCHOOLS-NURSERY & KINDERGARTEN ACADEMIC
26 835102 SCHOOLS-PRE-SCHOOL/KINDERGARTEN-ACADEMIC
27 864102 VETERANS & MILITARY ORGANIZATIONS
28 864102 VETERANS' & MILITARY ORGANIZATIONS
29 869904 WOMEN'S ORGANIZATIONS & SERVICES
30 869904 WOMENS ORGANIZATIONS & SERVICES
31 874881 EDUCATIONAL ASSESSEMENT
32 874881 EDUCATIONAL ASSESSMENT
33 912101 GOVERNMENT OFFICES-FEDERAL
34 912101 GOVERNMENT OFFICES-US
35 912104 GOVERNMENT OFFICES-CITY, VILLAGE & TWP
36 912104 GOVERNMENT OFFICES-CITY/VILLAGE & TWP
# Entries to correct code
fix_code <- sic_results$presence_wide %>%
semi_join(
sic_results$presence_wide %>%
filter(sic_desc %in% sic_results$desc_with_multiple_codes$sic_desc) %>%
distinct(sic_code),
by = "sic_code"
)# A tibble: 6 × 7
sic_code sic_desc primary_sic_code + s…¹ sic_code + sic6_desc…²
<chr> <chr> <lgl> <lgl>
1 573401 COMPUTER SOFTWARE FALSE TRUE
2 737206 <NA> FALSE TRUE
3 737206 COMPUTER SOFTWARE FALSE TRUE
4 824401 SCHOOLS-BUSINESS & SEC… FALSE TRUE
5 824401 SCHOOLS-BUSINESS & VOC… FALSE TRUE
6 824402 SCHOOLS-BUSINESS & SEC… FALSE TRUE
# ℹ abbreviated names: ¹`primary_sic_code + sic6_descriptions`,
# ²`sic_code + sic6_descriptions_sic`
# ℹ 3 more variables: `sic_code_1 + sic6_descriptions_sic1` <lgl>,
# `sic_code_2 + sic6_descriptions_sic2` <lgl>,
# `sic_code_3 + sic6_descriptions_sic3` <lgl>
# Entries to correct NA
fix_na <- sic_results$presence_wide %>%
semi_join(
sic_results$presence_wide %>%
filter(sic_desc %in% sic_results$desc_that_is_sometimes_na$sic_desc) %>%
distinct(sic_code),
by = "sic_code"
)# A tibble: 7 × 7
sic_code sic_desc primary_sic_code + sic6_de…¹ sic_code + sic6_desc…²
<chr> <chr> <lgl> <lgl>
1 573401 COMPUTER SOFTWARE FALSE TRUE
2 737206 <NA> FALSE TRUE
3 737206 COMPUTER SOFTWARE FALSE TRUE
4 821120 <NA> FALSE TRUE
5 821120 SCHOOL DISTRICTS FALSE TRUE
6 866124 <NA> TRUE TRUE
7 866124 MONASTERIES TRUE TRUE
# ℹ abbreviated names: ¹`primary_sic_code + sic6_descriptions`,
# ²`sic_code + sic6_descriptions_sic`
# ℹ 3 more variables: `sic_code_1 + sic6_descriptions_sic1` <lgl>,
# `sic_code_2 + sic6_descriptions_sic2` <lgl>,
# `sic_code_3 + sic6_descriptions_sic3` <lgl>
These SIC code characteristics are expected to vary with additional years of data or subsequent exports of previous reports, as Data Axle may update their database over time. It will therefore be important to process these columns from a raw data export prior to beginning any analysis or data validation.
All unique SIC values for a given ABI and address will be retained and expanded as overflow columns. The primary encoding will be constrained to ensure it remains consistent with qualifying classifiers, while all additional SIC columns will represent any other unique SIC values present.
Handling Additional Metadata
In the previous section, two encodings were examined: NAICS and SIC. NAICS remained uninformative, while SIC provided additional important outlet details (described above). The remaining metadata fall into two handling categories:
Metadata not directly relevant to the current project. These fields will be retained with no effort made to clean or validate the values.
Metadata supplying additional location details, such as county code, census block, and FIPS codes. These fields will be assessed for validation in subsequent steps while being handled similarly to category (a). Note that attributed census block and related fields may not translate directly to specific decennial maps, potentially necessitating reference to an external database.
As described above, metadata can vary by location and across the years a business was recorded. As no cleaning or validation effort is being applied to category (a), both location and year-level variation will be retained. The same approach will apply to category (b), with the addition of Boolean validation checks to capture:
Whether the same location outcome spans the complete expected date range.
Whether multiple location outcomes are associated with the same address.
Whether multiple identified location outcomes vary by decennial period or follow alternative patterns (source)
Show the code
# ---- Functions to Construct the Unique Outcomes with Dates String ----
summarize_code_ranges <- function(df, code_col, abi_col = "abi", address_col = "combined_address", year_col = "archive_version_year") {
#' @description
#' Summarize contiguous year ranges for a code by ABI + address (data.table)
#'
#' For each ABI/address pair, this creates a single string describing each
#' distinct `code_col` value and the contiguous year ranges in which it appears,
#' e.g. `"A (2018-2020), B (2022)"`.
#'
#' Rows are deduplicated at the ABI/address/year/code level before computing
#' runs. ABI/address pairs with no non-missing `code_col`+`year_col` observations
#' are retained and return `NA` in the output column.
#'
#' @param df A data.frame/data.table containing ABI, address, code, and year fields.
#' @param code_col Character scalar. Column name of the code/value to summarize.
#' @param abi_col Character scalar. Column name for ABI. Default `"abi"`.
#' @param address_col Character scalar. Column name for address. Default `"combined_address"`.
#' @param year_col Character scalar. Column name for the year field. Default `"archive_version_year"`.
#'
#' @return A data.frame with columns `{abi_col}`, `{address_col}`, and
#' `{paste0(code_col, "_ranges")}`.
#' @export
stopifnot(is.character(code_col), length(code_col) == 1)
out_col <- paste0(code_col, "_ranges")
# Convert to data.table (no copy if df already is one)
DT <- data.table::as.data.table(df)
# Build the ABI/address key set we always want to keep in the output
keys <- unique(
DT[!is.na(get(abi_col)) & !is.na(get(address_col)),
.(abi = get(abi_col), address = get(address_col))]
)
# ---- Overall "available" contiguous year ranges per ABI/address (ignores code_col) ----
Y <- DT[
!is.na(get(abi_col)) &
!is.na(get(address_col)) &
!is.na(get(year_col)),
.(
abi = get(abi_col),
address = get(address_col),
.year = suppressWarnings(as.integer(get(year_col)))
)
][!is.na(.year)] # drop rows where year can't be parsed
if (nrow(Y) == 0L) {
# No valid years anywhere: everyone gets NA for availability
avail <- keys[, .(available_year_ranges = NA_character_)]
} else {
# Sort + de-duplicate years within ABI/address
data.table::setorder(Y, abi, address, .year)
Y <- unique(Y, by = c("abi", "address", ".year"))
# Mark runs of consecutive years (new run when year != previous + 1)
Y[, .run := cumsum(.year != data.table::shift(.year, type = "lag", fill = .year[1L]) + 1L),
by = .(abi, address)]
# Summarize each run to start/end
avail_runs <- Y[, .(start_year = min(.year), end_year = max(.year)),
by = .(abi, address, .run)]
# Format run as "YYYY" or "YYYY-YYYY"
avail_runs[, date_range := data.table::fifelse(
start_year == end_year,
as.character(start_year),
paste0(start_year, "-", end_year)
)]
# Collapse all runs to one string per ABI/address
data.table::setorder(avail_runs, abi, address, start_year)
avail <- avail_runs[, .(available_year_ranges = paste(date_range, collapse = ", ")),
by = .(abi, address)]
}
# ---- Code-specific contiguous year ranges per ABI/address/code ----
X <- DT[
!is.na(get(abi_col)) &
!is.na(get(address_col)) &
!is.na(get(code_col)) &
!is.na(get(year_col)),
.(
abi = get(abi_col),
address = get(address_col),
.year = suppressWarnings(as.integer(get(year_col))),
.code = as.character(get(code_col))
)
][!is.na(.year)]
if (nrow(X) == 0L) {
# No valid code+year rows: return keys + availability + NA code ranges
out <- merge(keys, avail, by = c("abi", "address"), all.x = TRUE, sort = FALSE)
out[, (out_col) := NA_character_]
} else {
# Sort + de-duplicate at ABI/address/year/code
data.table::setorder(X, abi, address, .code, .year)
X <- unique(X, by = c("abi", "address", ".year", ".code"))
# Mark runs of consecutive years within each ABI/address/code
X[, .run := cumsum(.year != data.table::shift(.year, type = "lag", fill = .year[1L]) + 1L),
by = .(abi, address, .code)]
# Summarize each run to start/end
runs <- X[, .(start_year = min(.year), end_year = max(.year)),
by = .(abi, address, .code, .run)]
# Format run as "YYYY" or "YYYY-YYYY"
runs[, date_range := data.table::fifelse(
start_year == end_year,
as.character(start_year),
paste0(start_year, "-", end_year)
)]
# Create "CODE (range)" items and collapse per ABI/address
data.table::setorder(runs, abi, address, start_year, .code)
runs[, .item := paste0(.code, " (", date_range, ")")]
res <- runs[, .(val = paste(.item, collapse = ", ")), by = .(abi, address)]
data.table::setnames(res, "val", out_col)
# Merge everything back to full key set
out <- merge(keys, avail, by = c("abi", "address"), all.x = TRUE, sort = FALSE)
out <- merge(out, res, by = c("abi", "address"), all.x = TRUE, sort = FALSE)
# Normalize empty strings to NA
out[, (out_col) := data.table::fifelse(get(out_col) == "", NA_character_, get(out_col))]
}
# Restore original column names
data.table::setnames(out, c("abi", "address"), c(abi_col, address_col))
# Put available_year_ranges right after combined_address (address_col)
data.table::setcolorder(
out,
c(
abi_col, address_col, "available_year_ranges",
setdiff(names(out), c(abi_col, address_col, "available_year_ranges"))
)
)
as.data.frame(out)
}
summarize_many_code_ranges_dt <- function(df, vars, abi_col = "abi", address_col = "combined_address", year_col = "archive_version_year") {
#' @description
#' Summarize contiguous year ranges for many code columns (with progress bar)
#'
#' Calls `summarize_code_ranges()` for each variable in `vars` and merges results by ABI/address.
#' A console progress bar is displayed using `utils::txtProgressBar()`.
#'
#' @param df A data.frame/data.table containing ABI, address, and year fields.
#' @param vars Character vector of column names to summarize (each becomes `*_ranges`).
#' @param abi_col Character scalar. Column name for ABI. Default `"abi"`.
#' @param address_col Character scalar. Column name for address. Default `"combined_address"`.
#' @param year_col Character scalar. Column name for the year field. Default `"archive_version_year"`.
#'
#' @return A data.frame with one row per ABI/address and one `*_ranges` column per
#' element of `vars`.
#' @export
stopifnot(is.character(vars), length(vars) >= 1L)
DT <- as.data.table(df)
# Compute the base output ONCE (includes available_year_ranges)
base <- summarize_code_ranges(
df,
code_col = vars[[1]],
abi_col = abi_col,
address_col = address_col,
year_col = year_col
)
# Convert base to data.table and standardize key names for merging
out <- as.data.table(base)
setnames(out, c(abi_col, address_col), c("abi", "address"))
pb <- utils::txtProgressBar(min = 0, max = length(vars), style = 3)
on.exit(close(pb), add = TRUE)
for (i in seq_along(vars)) {
v <- vars[[i]]
newcol <- paste0(v, "_ranges")
tmp <- summarize_code_ranges(
df,
code_col = v,
abi_col = abi_col,
address_col = address_col,
year_col = year_col
)
tmpDT <- as.data.table(tmp)
setnames(tmpDT, c(abi_col, address_col), c("abi", "address"))
# Drop availability column(s) from tmp (we keep the one from base)
tmpDT[, grep("^available_year_ranges(\\.|$)", names(tmpDT), value = TRUE) := NULL]
# If out already has this *_ranges column, don't merge a second copy
if (newcol %in% names(out)) {
tmpDT[, (newcol) := NULL]
}
# Merge once (your original code merged twice, which guarantees duplicates)
out <- merge(out, tmpDT, by = c("abi", "address"), all.x = TRUE, sort = FALSE)
utils::setTxtProgressBar(pb, i)
}
# Restore original ABI/address column names
setnames(out, c("abi", "address"), c(abi_col, address_col))
# Ensure available_year_ranges is right after the address column
setcolorder(
out,
c(
abi_col, address_col, "available_year_ranges",
setdiff(names(out), c(abi_col, address_col, "available_year_ranges"))
)
)
as.data.frame(out)
}
# ---- Functions to Quality Check Compressed Location Information ----
avail_years <- function(x) {
#' @description
#' Expand an "available years" string into a sorted unique vector of years.
#' Parses strings like `"2002-2005, 2007, 2010-2012"` into an integer vector.
#' Used by [coverage_checks()] (and therefore indirectly by [check_ranges_same_outcome()]).
#'
#' @param x Character scalar. Comma-separated list of years or year ranges in the form
#' `"YYYY"` or `"YYYY-YYYY"`. Whitespace is allowed.
#'
#' @return Integer vector of years (may be length 0 if `x` is empty/NA/unparseable).
# Treat NA/blank as no years
if (is.na(x) || !nzchar(trimws(x))) return(integer(0))
# Split into tokens on commas (allow whitespace around commas)
parts <- stringr::str_split(x, "\\s*,\\s*")[[1]]
# Convert each token to years
yrs <- purrr::map(parts, function(p) {
# Match YYYY or YYYY-YYYY (with optional whitespace)
m <- stringr::str_match(p, "^\\s*(\\d{4})(?:\\s*-\\s*(\\d{4}))?\\s*$")
# Ignore unparseable token
if (anyNA(m)) return(integer(0))
# Start year
s <- as.integer(m[2])
# End year (or start if missing)
e <- as.integer(dplyr::if_else(is.na(m[3]), m[2], m[3]))
# Expand inclusive range
seq.int(s, e)
})
# Flatten, unique, sort
sort(unique(unlist(yrs)))
}
normalize_year <- function(y) {
#' @description
#' Normalize a year token into a 4-digit integer year and keeps digits only. If
#' more than 4 digits are present, uses the *last 4* digits (e.g., `"20011"` ->
#' `2011`). Used by [year_value_map()] (and therefore indirectly by
#' [coverage_checks()] and [check_ranges_same_outcome()]).
#'
#' @param y A value coercible to character containing a year.
#'
#' @return Integer year (e.g., `2011`) or `NA_integer_` if no digits are found.
# Keep digits only
y <- gsub("\\D", "", as.character(y))
# No digits => NA
if (!nzchar(y)) return(NA_integer_)
# If too long, keep last 4 digits (e.g., 20011 -> 2011)
if (nchar(y) > 4) y <- substr(y, nchar(y) - 3, nchar(y))
# Coerce to integer year
as.integer(y)
}
year_value_map <- function(x) {
#' @description
#' Parse a ranges string into a year -> value map (and detect conflicting overlaps).
#' Parses `"A (2002), B (2003-2005)"` into a mapping from each year to a value label.
#' Overlapping years with different labels are returned as conflicts.
#' Used by [coverage_checks()] (and therefore indirectly by [check_ranges_same_outcome()]).
#'
#' @param x Character scalar like `"0 (2002), 1 (2011-2016), 2 (2021-2022)"`.
#'
#' @return A list with:
#' \itemize{
#' \item `map`: named character vector (names are years, values are labels).
#' \item `conflict_years`: integer vector of years that have >1 distinct label.
#' }
# Empty input => empty mapping
if (is.na(x) || !nzchar(trimws(x))) {
return(list(
map = setNames(character(0), character(0)),
conflict_years = integer(0)
))
}
# Extract "<val> (YYYY[-YYYY])" segments
m <- stringr::str_match_all(
x,
"(?:^|,\\s*)([^,()]+?)\\s*\\(\\s*(\\d{4,})(?:\\s*-\\s*(\\d{4,}))?\\s*\\)"
)[[1]]
# No matches => empty mapping
if (nrow(m) == 0) {
return(list(
map = setNames(character(0), character(0)),
conflict_years = integer(0)
))
}
# Segment labels
vals <- stringr::str_trim(m[, 2])
# Start years (normalized)
starts <- vapply(m[, 3], normalize_year, integer(1))
# End years (or start if missing), normalized
ends <- vapply(ifelse(is.na(m[, 4]), m[, 3], m[, 4]), normalize_year, integer(1))
# Drop unparseable segments
keep <- !is.na(starts) & !is.na(ends)
vals <- vals[keep]
starts <- starts[keep]
ends <- ends[keep]
# All segments dropped => empty mapping
if (length(vals) == 0) {
return(list(
map = setNames(character(0), character(0)),
conflict_years = integer(0)
))
}
# Expand to one row per year
long <- purrr::map2_dfr(seq_along(vals), vals, function(i, v) {
# Ignore inverted ranges like "2016-2011"
if (ends[i] < starts[i]) {
return(tibble::tibble(year = integer(0), value = character(0)))
}
# Inclusive expansion
tibble::tibble(year = seq.int(starts[i], ends[i]), value = v)
})
# Nothing expanded => empty mapping
if (nrow(long) == 0) {
return(list(
map = setNames(character(0), character(0)),
conflict_years = integer(0)
))
}
# Years with >1 distinct label (overlaps)
conflict_years <- long |>
dplyr::group_by(year) |>
dplyr::summarise(nv = dplyr::n_distinct(value), .groups = "drop") |>
dplyr::filter(nv > 1) |>
dplyr::pull(year)
# First label per year (stable map for downstream checks)
map_first <- long |>
dplyr::group_by(year) |>
dplyr::summarise(value = dplyr::first(value), .groups = "drop")
# Named vector year -> value + conflicts
list(
map = setNames(map_first$value, as.character(map_first$year)),
conflict_years = conflict_years
)
}
compress_years <- function(years) {
#' @description
#' Compress a set of years into "start:end" runs.
#' `c(2003,2004,2005,2010,2012,2013)` -> `"2003:2005, 2010, 2012:2013"`.
#' Used by [coverage_checks()] (and therefore indirectly by [check_ranges_same_outcome()]).
#'
#' @param years Integer (or coercible) vector of years.
#'
#' @return Character scalar. Empty string if `years` is empty.
# Normalize + sort + unique
years <- sort(unique(as.integer(years)))
# Drop NA
years <- years[!is.na(years)]
# Nothing to format
if (length(years) == 0) return("")
# New run when gap != 1
breaks <- c(TRUE, diff(years) != 1L)
# Run id per year
grp <- cumsum(breaks)
# Convert each run to "start:end" (or single year)
parts <- vapply(split(years, grp), function(v) {
# Single year
if (length(v) == 1) return(as.character(v[1]))
# Consecutive run
paste0(v[1], ":", v[length(v)])
}, character(1))
# Join runs
paste(parts, collapse = ", ")
}
coverage_checks <- function(avail_str, ranges_str) {
#' @description
#' Compute coverage/uniqueness and "decade consistency" checks for one row.
#' Used by [check_ranges_same_outcome()] to evaluate each `(available_year_ranges, *_ranges)`
#' pair row-by-row.
#'
#' Returns `ok` (single value across all available years) and `decade_ok` (only when `ok` is FALSE:
#' within each decade, values are consistent; gaps are ignored).
#'
#' @param avail_str Character scalar, e.g. `"2002-2022"`.
#' @param ranges_str Character scalar, e.g. `"0 (2002), 1 (2011-2016), 2 (2021-2022)"`.
#'
#' @return Named list: `ok`, `decade_ok`, `missing_years`, `values_seen`, `conflict_years`.
# Expand availability to explicit years
yrs <- avail_years(avail_str)
# Can't check without available years
if (length(yrs) == 0) {
return(list(
ok = NA,
decade_ok = NA,
missing_years = NA_character_,
values_seen = NA_character_,
conflict_years = NA_character_
))
}
# Parse mapping + conflicts
parsed <- year_value_map(ranges_str)
# Named vector year -> value
m <- parsed$map
# Years as character for matching names()
avail_chr <- as.character(yrs)
# Missing available years
missing <- setdiff(avail_chr, names(m))
# Available years that are present in the mapping
present_years <- intersect(avail_chr, names(m))
# Values for those present available years
present_vals <- unname(m[present_years])
# Drop empty values defensively
present_vals <- present_vals[nzchar(present_vals)]
# Distinct values observed across available years
uniq_vals_all <- unique(present_vals)
# Overall pass/fail: full coverage, no conflicts, and exactly one unique value
ok <- (length(missing) == 0) &&
(length(parsed$conflict_years) == 0) &&
(length(present_vals) == length(avail_chr)) &&
(length(uniq_vals_all) == 1)
# Decade check only when ok is FALSE
if (isTRUE(ok)) {
# Not needed when overall check passes
decade_ok <- NA
} else {
# Fail if nothing is present at all
if (length(present_years) == 0) {
decade_ok <- FALSE
} else {
# Convert years to decade buckets (e.g., 2011 -> 2010)
yy <- as.integer(present_years)
d0 <- (yy %/% 10L) * 10L
# Within each decade, require <= 1 distinct value (ignore gaps)
decade_ok <- tibble::tibble(decade = d0, value = present_vals) |>
dplyr::group_by(decade) |>
dplyr::summarise(nv = dplyr::n_distinct(value), .groups = "drop") |>
dplyr::summarise(all_one = all(nv <= 1L), .groups = "drop") |>
dplyr::pull(all_one) |>
dplyr::first(default = FALSE)
}
}
# Return checks + formatted diagnostics
list(
ok = ok,
decade_ok = decade_ok,
missing_years = compress_years(missing),
values_seen = if (length(uniq_vals_all) == 0) "" else paste(sort(uniq_vals_all), collapse = ", "),
conflict_years = compress_years(parsed$conflict_years)
)
}
check_ranges_same_outcome <- function(df,
available_col = "available_year_ranges",
ranges_suffix = "_ranges") {
#' @description
#' Check all `*_ranges` columns against an availability column.
#' Main entry point. For every column ending in `ranges_suffix`, computes:
#' `{col}_ok`, `{col}_decade_ok`, `{col}_missing_years`, `{col}_values_seen`,
#' `{col}_conflict_years`.
#'
#' Internally, it calls [coverage_checks()] for each row/column pair.
#'
#' @param df A data.frame/tibble containing an availability column and one or more `*_ranges` columns.
#' @param available_col Name of the column containing available year ranges.
#' @param ranges_suffix Suffix used to identify the range columns (default `"_ranges"`).
#'
#' @return `df` with additional result columns appended.
stopifnot(available_col %in% names(df))
# Range columns in ORIGINAL order (from df)
range_cols <- names(df)[stringr::str_ends(names(df), stringr::fixed(ranges_suffix))]
range_cols <- setdiff(range_cols, available_col)
# Start with original df
out <- df
# Build outputs (without inserting yet) so we can reorder deterministically
created_names <- character(0)
for (col in range_cols) {
base <- stringr::str_remove(col, stringr::fixed(ranges_suffix))
res <- purrr::map2(df[[available_col]], df[[col]], coverage_checks)
nm_ok <- paste0(base, "_ranges_ok")
nm_decade_ok <- paste0(base, "_ranges_decade_ok")
nm_values <- paste0(base, "_values_seen")
nm_missing <- paste0(base, "_missing_years")
nm_conflict <- paste0(base, "_conflict_years")
out[[nm_ok]] <- purrr::map_lgl(res, "ok")
out[[nm_decade_ok]] <- purrr::map(res, "decade_ok") |> unlist(use.names = FALSE)
out[[nm_values]] <- purrr::map_chr(res, "values_seen")
out[[nm_missing]] <- purrr::map_chr(res, "missing_years")
out[[nm_conflict]] <- purrr::map_chr(res, "conflict_years")
created_names <- c(created_names, nm_ok, nm_decade_ok, nm_missing, nm_values, nm_conflict)
}
# Reorder columns so each block of derived columns sits immediately after its *_ranges column
# Keep all non-created columns in their original relative order.
orig <- names(df)
is_created <- names(out) %in% created_names
ordered <- character(0)
for (nm in orig) {
ordered <- c(ordered, nm)
if (nm %in% range_cols) {
base <- stringr::str_remove(nm, stringr::fixed(ranges_suffix))
ordered <- c(
ordered,
paste0(base, "_ranges_ok"),
paste0(base, "_ranges_decade_ok"),
paste0(base, "_values_seen"),
paste0(base, "_missing_years"),
paste0(base, "_conflict_years")
)
}
}
# Append any remaining columns (e.g., if df had other columns added elsewhere)
remaining <- setdiff(names(out), ordered)
out <- out[, c(ordered, remaining), drop = FALSE]
out
}Supplementary Location Metadata
The following illustrates the process of condensing additional location metadata into one line per unique ABI and address combination. Although some of these fields are numeric, they will be coerced to character, as each represents a classification regardless of whether it is encoded as a numeric value. Only longitude and latitude will be retained and treated as numeric.
vars_for_loc <- c(
# Core business location details (8)
"census_block", "census_tract", "county_code", "fips_code", "cbsa_level",
"cbsa_code", "csa_code", "area_code"
)
loc_condensed <- summarize_many_code_ranges_dt(subset, vars_for_loc) %>%
arrange(abi)
#' @description Codebook for the output fields produced by the evaluation. For
#' each variable condensed to unique abi and address combination,
#' additional checks are added.
#'
#' @field _ranges_OK Logical. TRUE when the `*_ranges` field (e.g.,
#' `census_tract_ranges`) assigns a value for every year
#' listed in `available_year_ranges`, has no overlapping year
#' conflicts (a year mapped to more than one distinct value),
#' and all available years map to exactly one unique value
#' overall. NA when `available_year_ranges` is missing or
#' cannot be parsed.
#'
#' @field _ranges_decade_ok Logical or NA. Only evaluated when `*_ranges_ok`
#' is FALSE. TRUE when, within each decade that has
#' any mapped available years, all mapped years in that
#' decade share the same value (i.e., ≤ 1 distinct
#' value per decade). This check ignores gaps (missing
#' years) within decades. NA when `*_ranges_ok` is TRUE
#' or when `available_year_ranges` is missing/unparseable.
#'
#' @field _values_seen Character. Comma-separated list of distinct values
#' observed for available years in the `*_ranges` mapping
#' (restricted to years in `available_year_ranges`). Empty
#' string if none are observed.
#'
#' @field _missing_years Character. Years present in `available_year_ranges`
#' that are not covered by the `*_ranges` mapping. Empty
#' string if none are missing. NA when
#' `available_year_ranges` is missing/unparseable.
#'
#' @field _conflict_years Character. Years that are assigned more than one
#' distinct value due to overlapping segments in the
#' `*_ranges` mapping, compressed as runs in the form
#' `"YYYY:YYYY"` and separated by `", "`. Empty string
#' if no conflicts.
area_code_qc <- check_ranges_same_outcome(loc_condensed) available_year_ranges
1 2003-2004
2 2005-2006
3 2007-2011
4 2012-2013
5 2007-2019, 2023-2025
6 2005-2025
7 2002-2022
8 2004-2025
census_tract_ranges
1 012101 (2003-2004)
2 012102 (2005-2006)
3 012101 (2007-2011)
4 012101 (2012-2013)
5 275510 (2007-2016), 275500 (2017-2019), 275500 (2023-2025)
6 013701 (2005-2023), 013703 (2024-2025)
7 000000 (2002), 001700 (2003-2016), 001701 (2017-2022)
8 004412 (2004-2025)
census_tract_ranges_ok census_tract_ranges_decade_ok census_tract_values_seen
1 TRUE NA 012101
2 TRUE NA 012102
3 TRUE NA 012101
4 TRUE NA 012101
5 FALSE FALSE 275500, 275510
6 FALSE FALSE 013701, 013703
7 FALSE FALSE 000000, 001700, 001701
8 TRUE NA 004412
census_tract_missing_years census_tract_conflict_years
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
7 <NA> <NA>
8 <NA> <NA>
Looking at the results for the census_tract variable, nearly half of the entries failed the consistency check and did not pass the check confirming that updates were made over decennial periods, as would be expected. This may be correct or could indicate broader data accuracy problems; either way, it points to notable inconsistencies that would be challenging to reconcile given the available results.
This assessment will be incorporated when condensing the full dataset. However, these findings suggest that the additional location values may need to be freshly associated using a trusted database rather than relying on the provided information.
Metadata Not Contributive to Project Goals
The following illustrates the process of condensing additional other metadata into one line per unique ABI and address combination. Note that primary_naics2_code is not represented here, as it will be derived from primary_naics8_code in the main dataset. This field falls under the “Unknown Variables or Code Translations” category.
vars_to_sum <- c(
# Core business details by business ID (4):
"company", "year_established", "subsidiary_number", "company_holding_status",
# Core business details by location and might change by year (7):
"site_number", "yellow_page_code", "office_size_code", "employee_size_location",
"location_employee_size_code", "sales_volume_location", "location_sales_volume_code",
# Parent company details, where some might change by year (5):
"parent_number", "parent_actual_employee_size", "parent_employee_size_code",
"parent_actual_sales_volume", "parent_sales_volume_code",
# Unknown variables or code translations (8):
#primary_naics2_code
"address_type_indicator", "industry_specific_first_byte",
"business_status_code", "population_code", "match_code", "ticker", "idcode"
)
subset_condensed <- summarize_many_code_ranges_dt(subset, vars_to_sum) %>%
arrange(abi) available_year_ranges employee_size_location_ranges
1 2003-2004 4 (2003-2004)
2 2005-2006 4 (2005-2006)
3 2007-2011 3 (2007-2011)
4 2012-2013 3 (2012), 2 (2013)
5 2007-2019, 2023-2025 3 (2007-2019), 3 (2023), 2 (2024-2025)
6 2005-2025 9 (2005-2006), 8 (2007-2025)
7 2002-2022 2 (2003), 4 (2004-2006), 1 (2007-2022)
8 2004-2025 10 (2004-2005), 9 (2006-2009), 3 (2010), 1 (2011-2025)
location_employee_size_code_ranges business_status_code_ranges
1 A (2003-2004) 9 (2003-2004)
2 A (2005-2006) 9 (2005-2006)
3 A (2007-2011) 9 (2007-2011)
4 A (2012-2013) 9 (2012-2013)
5 A (2007-2019), A (2023-2025) 9 (2007-2019), 9 (2023-2025)
6 B (2005-2025) 9 (2005-2025)
7 A (2002-2022) 9 (2003-2022)
8 C (2004-2005), B (2006-2009), A (2010-2025) 9 (2004-2025)
population_code_ranges
1 6 (2003-2004)
2 6 (2005-2006)
3 6 (2007-2011)
4 6 (2012-2013)
5 5 (2007-2015), 7 (2016-2019), 7 (2023-2025)
6 6 (2005-2015), 7 (2016-2025)
7 1 (2002-2015), 5 (2016-2022)
8 8 (2004-2015), 7 (2016-2025)
Concluding Thoughts
This document outlined additional insights and data processing requirements identified in the 2026 data format. A condensing framework is provided to collapse metadata fields that are not relevant to the project or are of questionable quality. This method retains unique values along with the years in which they were recorded, allowing analysts to preserve this information for follow-up data quality checks, validation, and cleaning.
Metadata Handling Summary
Rows will be condensed by unique abi and combined_address. NAICS codes will be split into the six leading digits and a trailing two-digit proprietary code, with the latter lacking a description. The six-digit section is consistent across all entries, while the trailing two digits will be condensed by unique value and observed date range.
SIC codes will be cleaned to ensure consistency and reconcile unexpected missing values or duplications. After condensing, the primary SIC codes will be organized first, followed by any additional overflow values detected across the years the ABI and address are recorded.
Business location will continue to be primarily attributed to the given address and longitude/latitude values. The assessment of the additional provided location fields presented here reveals unexpected inconsistencies that would be difficult or impractical to reconcile. For this reason, it is recommended that decennial location information be associated to each entry by geolocation, as outlined in the 2023 data form processing methods.
In addition to the above, data cleaning insights gleaned from the 2023 format are assumed to apply to this version as well: Review - 2023 Format.
Step 1 Algorithm Updates
The 2026 format begins in a different structure (long vs. wide), providing the opportunity to approach Step 1 of the outlined data processing differently: Step 1 Results - 2023 Format.
Prior to transforming the data from long form (years as rows) to wide form (years as columns), addresses will be assessed for similarity. Identified similar addresses will be applied at the time of metadata condensing to avoid a secondary condensing step on the wide form. Using the prior Step 1 method, duplications were mistakenly introduced and some addresses failed the geolocation quality check.
As a consequence, the following modifications will be implemented to account for these variations at the time of transformation:
Duplications were largely attributed to entries where the same
address_line_1details were associated with different cities and/or zip codes. Entries will be condensed by the compiled address instead ofaddress_line_1alone.The widened version of the ABI subset will undergo column-wise evaluation to ensure each year is only represented once (sum to 1 or 0). Rows for a given ABI that fail this will be marked as duplicated while all else will be marked as passed. Entries associated with an ABI where no duplication was added will be marked as NA.
Before condensing, prospective similar addresses will undergo a longitude/latitude similarity test. Those that fail will not be condensed by similar address, but will instead be condensed by unique value. Addresses that fail this test but share the same
address_line_1value will be marked for override yet still condensed by unique value, allowing for follow-up once addresses have been validated using the USPS API in Step 2.